The data set wine-quality-white-and-red.csv contains information from physicochemical and sensory tests performed on the white and red variants of the Portuguese “Vinho Verde” wine. It can be found here.
It contains the following variables:
Categorical or sensory values:
Numerical or physcochemical tests:
The data set consists of 6497 observations of 13 different variables:
## type fixed.acidity volatile.acidity citric.acid
## red :1599 Min. : 3.800 Min. :0.0800 Min. :0.0000
## white:4898 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500
## Median : 7.000 Median :0.2900 Median :0.3100
## Mean : 7.215 Mean :0.3397 Mean :0.3186
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900
## Max. :15.900 Max. :1.5800 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 1.00 Min. : 6.0
## 1st Qu.: 1.800 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0
## Median : 3.000 Median :0.04700 Median : 29.00 Median :118.0
## Mean : 5.443 Mean :0.05603 Mean : 30.53 Mean :115.7
## 3rd Qu.: 8.100 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0
## Max. :65.800 Max. :0.61100 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50
## Median :0.9949 Median :3.210 Median :0.5100 Median :10.30
## Mean :0.9947 Mean :3.219 Mean :0.5313 Mean :10.49
## 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30
## Max. :1.0390 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.818
## 3rd Qu.:6.000
## Max. :9.000
The data set is primarily conformed of white wine. Out the 6497 observations, 4898 are of white wine. This represents a 75% of the total data set.
Figure 2.1: Wine type distribution
When we analyze the density of each of the continuous variables, we see that all of them are rights hand skew. Meaning high concentration of points in the left tail or lower values of the variable. Might also be interpreted as a signal of outliers with high values. Another interesting observation is that most of them seem as bimodal distributions, which might be due to a different mode in each of the 2 types of wine.
Figure 2.2: Numerical variable densities
In the following plot we see the same plots as above, but divided for each type of wine. We can group the variables and their behavior for each type of wine in the following groups:
Figure 2.3: Numerical variable densities by type of wine
Now, let’s see if there’s any kind of relationship between all possible pair of variables.
Now lets see if there’s any different overall behavior between both types of wine. For this, we’ll use the PCP and Andrew’s plots.
The Parallel Coordinates Plot (PCP) can be useful to find highly correlated variables and distinct group behaviors. In our case we have so many observations that identifying correlated variables is almost impossible at plain sight. On the other hand, we can also notice that the behavior of both groups is pretty similar in most of the variables, but not in all of them, White wine has a larger arrange of values for the variables of free.sulfur.dioxide, total.sulfur.dioxide and residual.sugar. And red wine has a small group of observations (maybe outliers?) in the chlorides variable. It also seems to reach higher values than white wine for sulphates and volatile.acidity. Even saying all that, there isn’t such a clear cut between both types of wine.
That conclusion is also enforced by the Andrews plot in the lower side of our figure. In it we graph the finite Fourier series define by each observation. In this case, we can say that the the red wine lines still behave between what could be expected for the white wine plots.
Below we have the mean vector (mean value for each of the random variables) for the whole data set and each type of wine in it. From it we can see some actually interesting facts, for example that the mean value of residual.sugar for both wines is pretty different. Red wine has three times more sugar (in average) than white wine. In the same line as sugar we find both sulfur dioxide variables, free.sulfur.dioxide & total.sulfur.dioxide. In this case red wine has ~x2.5 and ~x3 higher values than the average white wine observation. On the contrary, white wine has considerably higher values of volatile.acidity (~x2) and chlorides (~x2).
As for the rest of the variables, there is no considerably difference to be made between both subgroups.
| All | White | Red | |
|---|---|---|---|
| fixed.acidity | 7.2153 | 8.3196 | 6.8548 |
| volatile.acidity | 0.3397 | 0.5278 | 0.2782 |
| citric.acid | 0.3186 | 0.2710 | 0.3342 |
| residual.sugar | 5.4432 | 2.5388 | 6.3914 |
| chlorides | 0.0560 | 0.0875 | 0.0458 |
| free.sulfur.dioxide | 30.5253 | 15.8749 | 35.3081 |
| total.sulfur.dioxide | 115.7446 | 46.4678 | 138.3607 |
| density | 0.9947 | 0.9967 | 0.9940 |
| pH | 3.2185 | 3.3111 | 3.1883 |
| sulphates | 0.5313 | 0.6581 | 0.4898 |
| alcohol | 10.4918 | 10.4230 | 10.5143 |
| quality | 5.8184 | 5.6360 | 5.8779 |
Following, we have the Covarince Matrices for the whole data set and for each type of wine. From them we can gather that most of the variables are independent from each other given that most of the entries are close to or exactly 0. But we do have some noticeable exceptions:
## fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity 1.68 0.05 0.06 -0.69
## volatile.acidity 0.05 0.03 -0.01 -0.15
## citric.acid 0.06 -0.01 0.02 0.10
## residual.sugar -0.69 -0.15 0.10 22.64
## chlorides 0.01 0.00 0.00 -0.02
## free.sulfur.dioxide -6.51 -1.03 0.34 34.02
## total.sulfur.dioxide -24.11 -3.86 1.60 133.24
## density 0.00 0.00 0.00 0.01
## pH -0.05 0.01 -0.01 -0.20
## sulphates 0.06 0.01 0.00 -0.13
## alcohol -0.15 -0.01 0.00 -2.04
## quality -0.09 -0.04 0.01 -0.15
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## fixed.acidity 0.01 -6.51 -24.11 0.00
## volatile.acidity 0.00 -1.03 -3.86 0.00
## citric.acid 0.00 0.34 1.60 0.00
## residual.sugar -0.02 34.02 133.24 0.01
## chlorides 0.00 -0.12 -0.55 0.00
## free.sulfur.dioxide -0.12 315.04 723.26 0.00
## total.sulfur.dioxide -0.55 723.26 3194.72 0.01
## density 0.00 0.00 0.01 0.00
## pH 0.00 -0.42 -2.17 0.00
## sulphates 0.00 -0.50 -2.32 0.00
## alcohol -0.01 -3.81 -17.91 0.00
## quality -0.01 0.86 -2.04 0.00
## pH sulphates alcohol quality
## fixed.acidity -0.05 0.06 -0.15 -0.09
## volatile.acidity 0.01 0.01 -0.01 -0.04
## citric.acid -0.01 0.00 0.00 0.01
## residual.sugar -0.20 -0.13 -2.04 -0.15
## chlorides 0.00 0.00 -0.01 -0.01
## free.sulfur.dioxide -0.42 -0.50 -3.81 0.86
## total.sulfur.dioxide -2.17 -2.32 -17.91 -2.04
## density 0.00 0.00 0.00 0.00
## pH 0.03 0.00 0.02 0.00
## sulphates 0.00 0.02 0.00 0.01
## alcohol 0.02 0.00 1.42 0.46
## quality 0.00 0.01 0.46 0.76
## fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity 3.03 -0.08 0.23 0.28
## volatile.acidity -0.08 0.03 -0.02 0.00
## citric.acid 0.23 -0.02 0.04 0.04
## residual.sugar 0.28 0.00 0.04 1.99
## chlorides 0.01 0.00 0.00 0.00
## free.sulfur.dioxide -2.80 -0.02 -0.12 2.76
## total.sulfur.dioxide -6.48 0.45 0.23 9.42
## density 0.00 0.00 0.00 0.00
## pH -0.18 0.01 -0.02 -0.02
## sulphates 0.05 -0.01 0.01 0.00
## alcohol -0.11 -0.04 0.02 0.06
## quality 0.17 -0.06 0.04 0.02
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## fixed.acidity 0.01 -2.80 -6.48 0
## volatile.acidity 0.00 -0.02 0.45 0
## citric.acid 0.00 -0.12 0.23 0
## residual.sugar 0.00 2.76 9.42 0
## chlorides 0.00 0.00 0.07 0
## free.sulfur.dioxide 0.00 109.41 229.74 0
## total.sulfur.dioxide 0.07 229.74 1082.10 0
## density 0.00 0.00 0.00 0
## pH 0.00 0.11 -0.34 0
## sulphates 0.00 0.09 0.24 0
## alcohol -0.01 -0.77 -7.21 0
## quality 0.00 -0.43 -4.92 0
## pH sulphates alcohol quality
## fixed.acidity -0.18 0.05 -0.11 0.17
## volatile.acidity 0.01 -0.01 -0.04 -0.06
## citric.acid -0.02 0.01 0.02 0.04
## residual.sugar -0.02 0.00 0.06 0.02
## chlorides 0.00 0.00 -0.01 0.00
## free.sulfur.dioxide 0.11 0.09 -0.77 -0.43
## total.sulfur.dioxide -0.34 0.24 -7.21 -4.92
## density 0.00 0.00 0.00 0.00
## pH 0.02 -0.01 0.03 -0.01
## sulphates -0.01 0.03 0.02 0.03
## alcohol 0.03 0.02 1.14 0.41
## quality -0.01 0.03 0.41 0.65
## fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity 0.71 0.00 0.03 0.38
## volatile.acidity 0.00 0.01 0.00 0.03
## citric.acid 0.03 0.00 0.01 0.06
## residual.sugar 0.38 0.03 0.06 25.73
## chlorides 0.00 0.00 0.00 0.01
## free.sulfur.dioxide -0.71 -0.17 0.19 25.80
## total.sulfur.dioxide 3.27 0.38 0.62 86.53
## density 0.00 0.00 0.00 0.01
## pH -0.05 0.00 0.00 -0.15
## sulphates 0.00 0.00 0.00 -0.02
## alcohol -0.13 0.01 -0.01 -2.81
## quality -0.08 -0.02 0.00 -0.44
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## fixed.acidity 0.00 -0.71 3.27 0.00
## volatile.acidity 0.00 -0.17 0.38 0.00
## citric.acid 0.00 0.19 0.62 0.00
## residual.sugar 0.01 25.80 86.53 0.01
## chlorides 0.00 0.04 0.18 0.00
## free.sulfur.dioxide 0.04 289.24 444.87 0.01
## total.sulfur.dioxide 0.18 444.87 1806.09 0.07
## density 0.00 0.01 0.07 0.00
## pH 0.00 0.00 0.01 0.00
## sulphates 0.00 0.11 0.65 0.00
## alcohol -0.01 -5.23 -23.48 0.00
## quality 0.00 0.12 -6.58 0.00
## pH sulphates alcohol quality
## fixed.acidity -0.05 0.00 -0.13 -0.08
## volatile.acidity 0.00 0.00 0.01 -0.02
## citric.acid 0.00 0.00 -0.01 0.00
## residual.sugar -0.15 -0.02 -2.81 -0.44
## chlorides 0.00 0.00 -0.01 0.00
## free.sulfur.dioxide 0.00 0.11 -5.23 0.12
## total.sulfur.dioxide 0.01 0.65 -23.48 -6.58
## density 0.00 0.00 0.00 0.00
## pH 0.02 0.00 0.02 0.01
## sulphates 0.00 0.01 0.00 0.01
## alcohol 0.02 0.00 1.51 0.47
## quality 0.01 0.01 0.47 0.78
Below we can find the correlation plots for every pair of variables for each sub group in the data set. Observation worth mentioning:
Figure 3.1: Correlation Plots by Wine Subgroup
We’ll use the Minimum Covariance Determinant (MCD) estimators to analyze the effect of out liers in our data. Also, this analysis must be performed separately for each type group.
Comparing the 12 eigen values for the covariance matrix of the whole red wine values in the data set against the values for the MCD matrix, we can see that there is a reduction in them using only the most centered values.
| Red Wine Eigen Values | 1133.807 | 57.93541 | 3.101302 | 1.819415 | 1.0463404 | 0.0413967 | 0.0231927 | 0.0113465 | 0.0100780 | 0.0014550 | 6e-07 |
| MCD Red Wine Eigen Values | 624.183 | 40.58744 | 3.314625 | 1.094648 | 0.1904331 | 0.0418075 | 0.0150331 | 0.0100670 | 0.0076606 | 0.0002149 | 4e-07 |
Figure 3.2: Red Wine Data Set and MCD Eigen values comparison
Now we compare the correlation between the variables using only the heaviest weighted observations against the full red wine information. Here we see that some of the correlations between variables increases, meaning the out liers are diminishing this relationships. That’s the case for the fixed.acidity & density, chlorides & density and sulphates & alcohol relationships.
Figure 3.3: Corraletion comparison between red wine subgroup
Using the %1 highest Chi Square for the 11 variables as threshold to classify as outlier, we get that there are 344 out liers in the red wine subgroup. This represents the 21.51% of our observations.
Figure 3.4: Outliers by Mahalanobis Distance
Now we visualize the same plot as in the previous chapter, but this time we colored the outliers as RED points in the plot. We notice that these outliers are the the points in the edge of the mass observed in each relationship and the “good” data seem to be in the middle of the group. We could name these observations as the most similar among itself.
Using the PCP and the Andrews plot we see a more distinct behavior between the outliers. In the PCP, those observation with extremely high values (specially in the chlorides and the residual.sugar variables) are the ones identified as outliers. On the Andrews plot, we see the outliers (blue lines) as the functions in the extremes, be it on the high side or the lower side of the group.
Using the same methodology as above, we use the MCD to find betters estimates to the parameters of our data. In this case, the heaviest weighted data does not improve our estimation of the covariance, as seen below on the comparison of eigen values for each matrix.
| White Wine Eigen Values | 1931.513 | 168.4529 | 21.56099 | 1.07442 | 0.6867086 | 0.0185319 | 0.0142898 | 0.0114461 | 0.0086420 | 0.0003961 | 3e-07 |
| MCD White Wine Eigen Values | 2060.696 | 145.1536 | 23.06863 | 1.10776 | 0.6676802 | 0.0187713 | 0.0114369 | 0.0083450 | 0.0053658 | 0.0000749 | 2e-07 |
Figure 3.5: White Wine Data Set and MCD Eigen values comparison
Even so, some relationship do become stronger in this subset. As is the case of the correlation between chlorides & alcohol and residual.sugar & chlorides.
Figure 3.6: Corraletion comparison between white wine subgroup
Again, we use the %1 highest Chi Square for the 11 variables as threshold to classify as outlier, we get that there are 536 out liers in the white wine subgroup. This represents the 10.94% of our observations.
Figure 3.7: White Wine Outliers by Mahalanobis Distance
Now we visualize the behaviour of the outliers between every pair of variables. The most interesting one is the outliers along the chlorides variable. You can notice that the “good” data is a very concentrated group in the left side of the plot and the outliers (red points) are all disperse to the right side. A similar behaviour can be seen along the volatile.acidity variable.
Figure 3.8: Scatter Plot Matrix of white wine variables (with outliers)
At last, we examine the differences by group using the PCP and Andrews plot. In the PCP we confirm our observation made on the previous plot that a clear indicator or outlier observations are high values of chlorides and volatile.acidity. In the Andrews plot we notice the same behaviour as in the red wine outlier, on which the outliers are those function on the extremes of the group.
Figure 3.9: PCP of White Wine (with outliers)
Figure 3.10: Andrews Plot of White Wine (with outliers)
In both subgroups we classified outliers as those observations with a squared mahalanobis distance larger than the 99-th percentile of the Chi Square distribution with 11 degrees of freedom. We could say that the amount of outliers this classification made was extremely high in both cases, which might be an indicator that our distances do not behave as a standard normal distribution.
For the sake of this project, we’ll still remove this observations from the analysis.
Now we’ll perform a principal component analysis (PCA) to reduce the dimensionality of our data.
From this analysis we get 12 principal components, each one independent from each other and explaining a certain percentage of the variability in our data. That is the information we can see on the left side plot below. On the right side plot we have the accumulated variability explained by each new dimension. From this we decided to use only the first 4 PC’s which explain 78% of our variability. So from this analysis we reduced our dimensions from 12 to 4.
In the below plot we visualize the loading of each variable in our data in the four dimensions selected previously. We can extract the following conclusions from it:
(This grouping could be interpreted as specific characteristics of the wine. Such as taste, texture, etc. But for that, one must have a certain chemical and winery expertise.)
Figure 4.1: Variable Loading for each Principal Component
If we plot each PC against each other and color each observation based on its type of wine, we can see if any component actually helps distinguish between this qualitative variable. In the scatter plot below we can see that more clear cut is always based in any comparison between any PC and Dim1. We could interpret this as the Dim1 explaining the difference between both types of wine mostly.
Figure 4.2: Scatter Plot of Principal Components by Type of Wine
Finally, we can test the correlation between each component against the original variables on the data set. Interestingly, after the 4th component, there doesn’t seems to be any relevant correlation between the values, enforcing our decision of taking into account only the first four pc’s. When we compare this correlation values against the loading analyzed previously, we get that all the loadings higher than the threshold established (at \(\sqrt\frac{1}{p}\)) have a high direct or inverse correlation (\(\lvert x_i \rvert \ge 0.5\)) with its principal component.
Figure 4.3: Corraletion Plot between PC’s and OG Variables
Lastly, we’ll use clustering algorithms to find hidden groups inside our data.
First, we determined how many group there might be. For this we used 3 approaches:
Figure 5.1: Optimal number of clusters with different methods
From this 3 methods we get different conclusions:
With a vote of 2 out 3, we decided to move forward with a k = 3.
Now we try 3 different clustering algorithms, setting the centers at 3.
We also try a second implementation of PAM, but instead of building it with the quantitative data (all this methods work just with quantitative values), we calculate the Gower Distance for the whole data set (including categorical variables) and use that matrix as input for the PAM algorithm.
We visualize the result of this algorithms plotting the observations using the first 2 principal components. These pc’s explain 54% of the variability in the data.
Figure 5.2: Clustering Algorithms with K = 3
From the results of the clustering algorithms we conclude that the PAM implementation using the Gower distance has the better grouping (at least from the perspective of these PC’s). Although that is not a completely fair comparison, giving that it does have more information that the other algorithms. The rest of the methods have seemingly the same result.
We can compare the performance of this algorithms using its average silhouette width. The closest this value is to 1, the better grouping the algorithm did. Taking this observation into account, the PAM with Gower clustering has the lowest score of all. Its “better” performance in the previous plot might be due to the perspective on which the analysis was made, viewing only through the first 2 principal component. The rest of the method have the same average silhouette width of 0.51, meaning all of them did a pretty good job clustering.
If we had to choose only one method based on this metric, it would be the K-Means with k = 3. It has the same average silhouette than the other 2 methods, but it missclasified less values than PAM and CLARA. We notice this in the back tail on the left side of the silhouettes of each cluster.
Now we’ll use hierarchical clustering algorithms to group our data. The difference with these algorithms is that they do not require to fix the number of groups in advance. These methods work by merging smaller groups into larger ones or dividing larger groups into smaller ones of similar data. The procedure creates a hierarchy of clusters represented with dendograms.
These methods take the bottom-up approach, merging small groups (initially each observation) into larger ones. We’ll compare 4 agglomerative clustering algorithms: Single Linkage, Complete Linkage, Average Linkage and Ward Linkage. The difference between these algorithms are the metrics each one uses to merge clusters.
| Method | Average Silhouette Width |
|---|---|
| Single Linkage | 0.151 |
| Complete Linkage | 0.380 |
| Average Linkage | 0.455 |
| Ward Linkage | 0.432 |
Based on the average silhouette width of all the linkage algorithms, the ward linkage method worked the best on our data. Interestingly, the single linkage only found one big cluster.
The divisive algorithm use a top down approach to clustering data. This means it starts with one big cluster and after each step each existing cluster is divided into two clusters. The most popular algorithm in this family is the DIvisive ANAlysis Clustering (DIANA).
The average silhouette width of the method was 0.38. This means it did not outperformed the average and ward linkage algorithms.
At the end, the best clustering algorithm was K-Means with as average silhouette of 0.51.